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Abstract This paper presents a middle-level video rep¬ 
resentation named Video Primal Sketch (VPS), which 
integrates two regimes of models: i) sparse coding model 
using static or moving primitives to explicitly represent 
moving corners, lines, feature points, etc., ii) FRAME 
/MRF model reproducing feature statistics extracted 
from input video to implicitly represent textured mo¬ 
tion, such as water and fire. The feature statistics in¬ 
clude histograms of spatio-temporal filters and velocity 
distributions. This paper makes three contributions to 
the literature: i) Learning a dictionary of video prim¬ 
itives using parametric generative models; ii) Propos¬ 
ing the Spatio-Temporal FRAME (ST-FRAME) and 
Motion-Appearance FRAME (MA-FRAME) models for 
modeling and synthesizing textured motion; and iii) De¬ 
veloping a parsimonious hybrid model for generic video 
representation. Given an input video, VPS selects the 
proper models automatically for different motion pat¬ 
terns and is compatible with high-level action repre¬ 
sentations. In the experiments, we synthesize a number 
of textured motion; reconstruct real videos using the 
VPS; report a series of human perception experiments 
to verify the quality of reconstructed videos; demon¬ 
strate how the VPS changes over the scale transition in 
videos; and present the close connection between VPS 
and high-level action models. 

Keywords Middle-level vision • Video representation • 
Textured motion • Dynamic texture synthesis • Primal 
sketch 
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Fig. 1 The four types of local video patches characterized 
by two criteria - sketchability and trackability. 


1 Introduction 


1.1 Motivation 


Videos of natural scenes contain vast varieties of mo¬ 
tion patterns. We divide these motion patterns in a 
2x2 table based on their complexities measured by 
two criteria: i) sketchability (Guo et al ( 2QQ7[ )), i.e. 
whether a local patch can be represented explicitly by 
an image primitive from a sparse coding dictionary. 


and ii) intrackability (or trackability) (Gong and Zhu 
( 2Q12[ )), which measures the uncertainty of tracking an 
image patch using the entropy of posterior probability 
on velocities. FigH] shows some examples of the differ¬ 
ent video patches in the four categories. Gategory A 
consists of the simplest vision phenomena, i.e. sketch- 
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able and trackable motions, such as trackable corners, 
lines, and feature points, whose positions and shapes 
can be tracked between frames. For example, patches 
(a), (b), (c) and (d) belong to category A. Category D 
is the most complex and is called textured motions or 
dynamic texture in the literature, such as water, fire or 
grass, in which the images have no distinct primitives or 
trackable motion, such as patches (h) and (i). The other 
categories are in between. Category B refers to sketch- 
able but intrackable patches, which can be described by 
distinct image primitives but hardly be tracked between 
frames due to fast motion, for example the patches 
(e) and (f) at the legs of the galloping horse. Finally 
category C includes the trackable but non-sketchable 
patches, which are cluttered features or moving kernels, 
e.g. patch (g). 

In the vision literature, as it was pointed out by 
(Shi and Zhu ( 2QQ7| )), there are two families of repre¬ 
sentations, which code images or videos by explicit and 
implicit functions respectively. 

1, Explicit representations with generative models. 


(Olshausen (2003); 


Kim et al 


(2010)) learned an over¬ 


complete set of coding elements from natural video se¬ 
quences using the sparse coding model ([Olshausen and 


Field (1996)). ( Elder and Zucker] ( |1998| )) and (Guo et al 


( 2007| )) represented the image/video patches by fitting 
functions with explicit geometric and photometric pa¬ 


rameters. (Wang and Zhu (2004)) synthesized complex 
motion, such as birds, snowflakes, and waves with a 


large mount of particles and wave components. (Black 


and Elect (2000)) represented two types of motion prim¬ 
itives, namely smooth motion and motion boundaries 
for motion segmentation. In higher level object motion 
tracking, people represented different tracking units de¬ 
pending on the underlying objects and scales, such as 


sparse or dense feature points tracking (Serby et al 


maniciu et al 


(2004),Black and Elect (2000)), kernels tracking (Go¬ 


ing (Maccormick and Blake (2000)), and middle-level 


^2003); Ean et al (2006|)), contours track- 


pairwise-components generation (Yuan et al (2010)). 

2, Implicit representations with descriptive models. 
Eor textured motions or dynamic textures, people used 
numerous Markov models which are constrained to re¬ 
produce some statistics extracted from the input video. 


Eor example, dynamic textures (Szummer and Picard 


(1996); Gampbell et al ( |2002| )) were modeled by a spatio- 
temporal auto-regressive (STAR) model, in which the 
intensity of each pixel was represented by a linear sum¬ 
mation of intensities of its spatial and temporal neigh¬ 


bors. (Bouthemy et al (2006)) proposed a mixed-state 
auto-models for motion textures by generalizing the 
auto-models in (Besag (1974)). ( jPoretto et al (2003)) 
derived an auto-regression moving-average model for 


dynamic texture. (Ghan and Vasconcelos ( 2008| )) and 
( Ravichandran et al| ( |2009| )) extended it to a stable lin¬ 
ear dynamical system (LDS) model. 

Recently, to represent complex motion, such as hu¬ 
man activities, researchers have used Histogram of Ori¬ 
ented Gradients (HOG) ( [Dalai and Triggs (2005)) for 
appearance and Histogram of Oriented Optical-Elow 
(HOOE) ( [Dalai et al (2006); Ghaudhry et al (2009)) for 
motion. The HOG and HOOE record the rough geomet¬ 
ric information through the grids and pool the statis¬ 
tics (histograms) within the local cells. Such features 
are used for recognition in discriminative tasks, such 
as action classification, and are not suitable for video 
coding and reconstruction. 

In the literature, these video representations are of¬ 
ten manually selected for specific videos in different 
tasks. There lacks a generic representation and crite¬ 
rion that can automatically select the proper models 
for different patterns of the video. Eurt her more, as it 
was demonstrated in (Gong and Zhu ( 2Q12[ )) that both 
sketchability and trackability change over scales, den¬ 
sities, and stochasticity of the dynamics, a good video 
representation must adapt itself continuously in a long 
video sequence. 


1.2 Overview and contributions 

Motivated by the above observations, we study a unified 
middle-level representation, called video primal sketch 
(VPS), by integrating the two families of representa¬ 
tions. Our work is inspired by Marr’s conjecture for a 
generic “token” representation called primal sketch as 
the output of early vision (Marr ( 1982[ )), and is aimed 


at extending the primal sketch model proposed by (Guo 


et al ( 2QQ7[ )) from images to videos. Our goal is not only 
to provide a parsimonious model for video compression 
and coding, but more importantly, to support and be 
compatible with high-level tasks such as motion track¬ 
ing and action recognition. 

Eig|^ overviews an example of the video primal sketch 
FiglK a) is an input video frame which is separated into 
sketchable and non-sketchable regions by the sketcha¬ 
bility map in (b), and trackable primitives and intrack¬ 
able regions by the trackability map in (c). The sketch- 
able or trackable regions are explicitly represented by a 
sparse coding model and reconstructed in (d) with mo¬ 
tion primitives, and each non-sketchable and intrack¬ 
able region has a textured motion which is synthesized 


in (e) by a generalized ERAME (Zhu et al (1998)) 
model (implicit and descriptive). The synthesis of this 
frame is shown in (f) which integrates the results from 
(d) and (e) seamlessly. 
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(c) Trackablility Map 


(d) Sketchable/Trackable Parts 
Reconstruction 


(a) Input 

4 


(b) Sketchability Map 

* 


(f) Synthesized Frame 

oil 

(e) Textured Motion Synthesis 


Fig. 2 An example of Video Primal Sketch, (a) An input 
frame, (b) Sketchability map where dark means sketchable. 
(c) Trackability map where darker means higher trackabil- 
ity. (d) Reconstruction of explicit regions using primitives. 

(e) Synthesis for implicit regions (textured motions) by sam¬ 
pling the generalize FRAME model through Markov chain 
Monte Carlo using the explicit regions as boundary condition. 

(f) Synthesized frame by combining the explicit and implicit 
representations. 


As Table shows, the explicit representations in¬ 
clude 3, 600 parameters for the positions, types, motion 
velocities, etc of the video primitives and the implicit 
representations have 420 parameters for the histograms 
of a set of filter responses on dynamic textures. This ta¬ 
ble shows the efficiency of the VPS model. 

This paper makes the following contributions to the 
literature. 


1. We present and compare two different but related 
models to define textured motions. The first one 
is a spatio-temporal FRAME (ST-FRAME) model, 
which is a non-parametric Markov random field and 
generalizes the FRAME model (Zhu et al (1998)) of 
texture with spatio-temporal filters. The ST-FRAME 
model is learned so that it has marginal probabil¬ 
ities that match the histograms of the responses 


Table 1 The parameters in video primal sketch model for 
the water bird video in Fig|^ 


Video Resolution 

288x352 pixels 

Explicit Region 

31,644 pixels?^ 30% 

Primitive Number 

300 

Primitive Width 

11 pixels 

Explicit Parameters 

3,600 w 3.6% 

Implicit parameters 

420 


2 . 


3. 


4. 


from the spatio-temporal filters on the input video. 
The second one is a motion-appearance FRAME 
model (MA-FRAME), which not only matches the 
histograms of some spatio-temporal filter responses, 
but also matches the histograms of velocities pooled 
over a local region. The MA-FRAME model achieves 
better results in video synthesis than the ST-FRAME 
model, and it is, to some extent, similar to the HOOF 
features used in action classification (Dalai et al 
(2006); Chaudhry et al (2009)). 

We learn a dictionary of motion primitives from in¬ 
put videos using a generative sparse coding model. 
These primitives are used to reconstruct the explicit 
regions and include two types: i) generic primitives 
for the sketchable patches, such as corners, bars etc; 
and ii) specific primitives for the non-sketchable but 
trackable patches which are usually texture patches 


similar to those used in kernel tracking (Comaniciu 
et al ( 2003[ )). 


The models for implicit and explicit regions are inte¬ 
grated in a hybrid representation - the video primal 
sketch (VPS), as a generic middle-level representa¬ 
tion of video. We will also show how VPS changes 
over information scales affected by distance, density 
and dynamics. 

We show the connections between this middle-level 
VPS representation and features for high-level vi¬ 
sion tasks such as action recognition. 


Our work is inspired by Gong’s empirical study in 
Gong and Zhu (2012), which revealed the statistical 


properties of videos over scale transitions and defined 
intrackability as the entropy of local velocities. When 
the entropy is high, the patch cannot be tracked lo¬ 
cally and thus its motion is represented by a velocity 
histogram. Gong and Zhu (2012) did not give a unified 
model for video representation and synthesis which is 
the focus on the current paper. 

This paper extends a previous conference paper (Har 
et al ( 2011| )) in the following aspects: 


1. We propose a new dynamic texture model, MA- 
FRAME, for better representing velocity informa¬ 
tion. Benefited from the new temporal feature, the 






















































4 


Zhi Han^’^’^ et al. 


2 . 

3. 


VPS model can be applied to high-level action rep¬ 
resentation tasks more directly. 

We compare spatial and temporal features with HOG 


(Dalai and Triggs (2005)) and HOOF (Dalai et al 


(2006)) and discuss the connections between them. 
We do a series of perceptual experiments to verify 
the high quality of video synthesis from the aspect 
of human perception. 


The remainder of this paper is organized as follows. 
In Section 2, we present the framework of video pri¬ 
mal sketch. In Section 3, we explain the algorithms for 
explicit representation, textured motion synthesis and 
video synthesis, and show a series of experiments. The 
paper is concluded with a discussion in Section 4. 


2 Video primal sketch model 


In his monumental book (Marr ( 1982[ )), Marr conjec¬ 
tured a primal sketch as the output of early vision that 
transfers the continuous “analogy” signals in pixels to 
a discrete “token” representation. The latter should be 
parsimonious and sufficient to reconstruct the observed 
image without much perceivable distortions. A mathe¬ 


matical model was later studied by Guo, et al (Guo et al 


( 2QQ7| )), which successfully modeled hundreds of images 
by integrating sketchable structures and non-sketchable 
textures. In this section, we extend it to video primal 
sketch as a hybrid generic video representation. 

Let I[l,m] = be a video defined on a 3D 

lattice A C Z^. A is divided disjointly into explicit and 
implicit regions. 


A — Aqx Cirri’) ^ex ^im — 0- 


( 1 ) 


Then the video I is decomposed as two components 

n = (2) 

are defined by explicit functions / = g{w), in 
which, each instance is corresponded to a different func¬ 
tion form of ^0 and indexed by a particular value of 
parameter w. And I^i.^ are defined by implicit func¬ 
tions H{I) = h, in which, HQ extracts the statistics of 
filter responses from image I and h is a specific value 
of histograms. 

In the following, we first present the two families of 
models for and respectively, and then inte¬ 
grate them in the VPS model. 



Sketchable Region 


Trackable Region 


Fig. 3 Comparison between sketchable and trackable re¬ 
gions. 


2.1 Explicit representation by sparse coding 

The explicit region Aex of a video I is decomposed into 
flex disjoint domains (usually Uex is in the order of 10^), 

riex 

^ex — ^ex,i' (3) 

i=l 

Here Aex,i C A defines the domain of a “brick”. A brick, 
denoted by 1a^^ ^ ? is a spatio-temporal volume like a 
patch in images. These bricks are divided into the three 
categories A, B and G as we mentioned in section 

The size of Aex,i influences the results of tracking 
and synthesis to some degree. The spatial size should 
depend on the scale of structures or the granularity 
of textures, and the temporal size should depend on 
the motion amplitude and frequency in time dimension, 
which are hard to estimate in real applications. How¬ 
ever, a general size works well for most of cases, say 
11 X 11 pixels X 3 frames for trackable bricks (sketch- 
able or non-sketchable), or 11 x 11 pixelsxl frame for 
sketchable but intrackable bricks. Therefore, in all the 
experiments of this paper, the size of Aex,i is chosen as 
such. 

Figshows one example comparing the sketchable 
and trackable regions based on sketchability and track- 
ability maps shown in FigJ^b) and (c) respectively. It 
is worth noting that the two regions overlaps with only 
a small percentage of the regions is either sketchable or 
trackable. 

Each brick can be represented by a primitive Bi G 
Ab through an explicit function, 

= aiBi{x,y,t) + e, \/{x,y,t) e Aex,i- (4) 

Bi means the ith primitive from the primitive dictio¬ 
nary Ab-, which fits the brick ^Aex i best. Here i indexes 
the parameters such as type, position, orientation and 
scale of Bi. ai is the corresponding coefficient, e repre¬ 
sents the residue, which is assumed to be i.i.d. Gaussian. 
For an trackable primitive, Bi{x^yA) includes 3 frames 
and thus encodes the velocity (u^v) in the 3 frames. 
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Fig. 4 Some selected examples of primitives. Z\b is a dictio¬ 
nary of primitives with velocities (u,v) ( (u,v) is not shown), 
such as blobs, ridges, edges and special primitives. 


For sketchable but intrackable primitive, Bi{x,y,t) has 
only 1 frame. 

As Fig. [^illustrates, the dictionary Ab is composed 
of two categories: 

— Common primitives /y^ommon ^ These are primitives 
shared by most videos, such as blobs, edges and 
ridges etc. They have explicit parameters for orien¬ 
tations and scales. They are mostly belong to sketch- 
able region as shown in Fig. 

— Special primitives . These bricks do not have 

common appearance and are limited to specific video 
frames. They are non-sketchable but trackable, and 
are recorded to code the specific video region. They 
are mostly belong to trackable region but not in¬ 
cluded in sketchable region as shown in Fig. 

To be noted, the primitives and categories shown in 
Fig-i are some selected examples, but not the whole 
dictionary. The details for learning these primitives are 
introduced in section [321 

@ uses only one base function and thus is different 
from conventional linear additive model. Follows the 
Gaussian assumption for the residues, we have the fol¬ 
lowing probabilistic model for the explicit region 

'^ex 

p(nejB,Q;) = —^^expl-Ei} 

(ztt) 2 af 

( 5 ) 

p ^ - aiBi{x,y,t)f 

* ^ 2a‘f 

(^x^y ,t) ,i 

where B = (5i,..., represents the selected primi¬ 

tive set, n is the size of each primitive, Uex is the num¬ 
ber of selected primitives and (Ji is estimated standard 
deviation of representing natural videos by Bi. 


2.2 Implicit representations by FRAME models 


The implicit region Aim of video I is segmented into riim 
(usually riim is no more than 10) disjoint homogeneous 
textured motion regions. 


^im 

i=i 


(6) 


One effective approach for texture modeling is to 
pool the histograms for a set of filters (Gabor, DoG and 


DooG) on the input image (Bergen and Adelson 
Ghubb and Landy (1991|; 


Heeger and Bergen (1995); 


1991 


Zhu et al (1998); Portilla and Simoncelli (2000)). Since 
Gabor filters model the response functions of the neu¬ 
rons in the primary visual cortex, two texture images 
with the same histograms of filter responses generate 
the same texture impression, and thus are considered 
perceptually equivalent (Silverman et “ail ([198^ ). The 
FRAME model proposed in (Zhu et al 1998[ )) generates 
the expected marginal statistics to match the observed 
histograms through the maximum entropy principle. As 
a result, any images drawn from this model will have 
the same filtered histograms and thus can be used for 
synthesis or reconstruction. 

We extend this concept to video by adding tempo¬ 
ral constraints and define each homogeneous textured 
motion region . by an equivalence class of videos. 


HKihj) = . : Hk{Iy 


r..j) = hk,j,k = 

( 7 ) 


where hj = (hi^j,..., hKj) is a series of ID histograms 
of filtered responses that characterize the macroscopic 
properties of the textured motion pattern. Thus we only 
need to code the histograms hj and synthesize the tex¬ 
tured motion region ^Aimj sampling from the set 
i?x(hj)- As ^Aimj is defined by the implicit functions, 
we call it an implicit representation. These regions are 
coded up to an equivalence class in contrast to recon¬ 
structing the pixel intensities in the explicit represen¬ 
tation. 

To capture temporal constraints, one straightfor¬ 
ward method is to choose a set of spatio-temporal fil¬ 
ters and calculate the histograms of the filter responses. 
This leads to the spatio-temporal FRAME (ST-ERAME) 
model which will be introduced in section 1221 Another 
method is to compute the statistics of velocity. Since 
the motion in these regions is intrackable, at each point 
of the image, its velocity is ambiguous (large entropy). 
We pool the histograms of velocities locally in a way 
similar to the HOOE (Histogram of Oriented Optical- 
Elow) ( [Dalai et al ( 2QQ6[ ); [Chaudhry et al (2009)) fea¬ 
tures in action classification. This leads to the motion- 
appearance ERAME (MA-ERAME) model which uses 
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Fig. 5 Z\f is a dictionary of spatio-temporal filters including 
static, motion and flicker filters. 


histograms of both appearance (static filters) and ve¬ 
locities. We will elaborate on this model in section |2^ 


2.3 Implicit representation by spatio-temporal 
FRAME 


ST-FRAME is an extension of the FRAME model (Zhu 
|et al| ( ploM )) by adopting spatio-temporal filters. 

A set of filters F is selected from a filter bank Af- 
Figillustrates the three types of filters in Ap- i) the 
static filters for texture appearance in a single image; 
ii) the motion filter with certain velocity; and iii) the 
flicker filter that have zero velocity but opposite signs 
between adjacent frames. For each filter G F, the 
spatio-temporal filter response of I at (x, y, t) G Aimj 
is F/c*I(x, t). The convolution is over spatial and tem¬ 
poral domain. By pooling the filter responses over all 
{x,y,t) G Aimj, we obtain a number of ID histograms 


Wj r ^ ^ \^ ...^ K. 

where z indexes the histogram bins, and 6{z;x) = 1 if 
X belongs to bin 2 ;, and 6{z; x) = 0 otherwise. Following 
the FRAME model, the statistical model of textured 
motion I^i . is written in the form of the following 
Gibbs distribution, 

(xexp{-^(/3fej,iJfe(Iyi,^^.))}. (9) 

k 


where Pk = {PkA^Pk, 2 i are potential functions. 

According to the theorem of ensemble equivalence 
et al ( 2QQQ| )), the Gibbs distribution converges to the 


uniform distribution over the set i7K(hj) in Q, when 


Hm,j 


is large enough. For any fixed local brick Aq C 


Aim,j^ the distribution of Iaq follows the Markov ran¬ 
dom field model The model can describe textured 
motion located in an irregular shape region Aimj- 
The filters in F are pursued one by one from the fil¬ 
ter bank Ap so that the information gain is maximized 
at each step. 

F,* = arg max \\H'r - (10) 

and are the response histograms of before 

and after synthesizing . by adding Ff^ respectively. 
The larger the difference, the more important is the 
filter. 

Following the distribution form of (§, the proba¬ 
bilistic model of implicit parts of I is defined as 

'Tlim 

p(I^,„;F,/3)ocnP(n.„.,;F,/3). (11) 

i=i 

where F = (Fi,...,Fx) represents the selected spatio- 
temporal filter set. 

In the experiments described later, we demonstrate 
that this model can synthesize a range of dynamic tex¬ 
tures by matching the histograms of filter responses. 
The synthesis is done through sampling the probability 
by Markov chain Monte Garlo. 


2.4 Implicit representation by motion-appearance 
FRAME 


Different from ST-FRAME, in which, temporal con¬ 
straints are based on spatio-temporal filters, the MA- 
FRAME model uses the statistics of velocities, in addi¬ 
tion to the statistics of filter responses for appearance. 

For the appearance constraints, the filter response 
histograms are obtained similarly as ST-FRAME 
in @ 




k 

1 


|A 


im,j 


(^x ,y ,t) ,j 


( 12 ) 

,K- 


where the filter set F includes static and flicker filters 
in Af- 

For the motion constraints, the velocity distribu¬ 
tion of each local patch is estimated via the calculation 
of trackability (Gong and Zhu ( 2Q12[ )), in which, each 
patch is compared with its spatial neighborhood in ad¬ 
jacent frame and the probability of the local velocity v 
is computed as 


p{v\I{x,y,t- l),I{x,y,t)) 
cx expj-^^^-} 


(13) 
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Here, a is the standard deviation of the differences be¬ 
tween local patches from adjacent frames based on var¬ 
ious velocities. The statistical information of velocities 
for a certain area of texture is approximated by aver¬ 
aging the velocity distribution over region Aimj 

(14) 

{x,y,t)eAim,j 

Let H = collect the fil- 

ter responses and velocities histograms of the video. 
The statistical model of textured motion lAimj can be 
written in the form of the following joint Gibbs distri¬ 
bution, 

oc exp{-(/3,H(n.„,,-,VA;„,^))}. (15) 

Here, [3 is the parameter of the model. 

In summary, the probabilistic model for the implicit 
regions of I is defined as 


We denote by VPS = (B, H) the representation for 

the video I^, where H = {{hk,i}kA^ 
includes the histograms described by F and V; and B 
includes all the primitives with parameters for their in¬ 
dexes, position, orientation and scales etc. p{VPS) = 
p(B,H)=p(B)p(H) gives the prior probability of video 
representation by VPS. p(B) (x exp{ —|B|}, in which, 
|B| is the number of primitives. p(H) (x exp{— 7 tecc(H)}, 
in which, is the energy term and for instance, 

7 tea:(H) = priim to penalize the number of implicit re¬ 
gions. Thus, the best video representation VPS'" is ob¬ 
tained by maximizing the posterior probability. 


VPS^ = argmaxp(HP5'|I^) = argmaxp(I^|HP5')p(HP5') 

{l{x,y,t) - aiBi{x,y,t)f 


= arg mjK - exp{- y] 


i=l {x,y,t)^Ae 


2af 


'Tiim 


-|B|-^^(/3,,,-,Lffe)-7te.(H)}. 
j = l k = l 


(19) 


f t'l 

P(n™;F,/3) OC nP(ni™,i;F,/?). (16) 

i=i 

where F = (Pi,..., Fk) represents the selected filter set. 

In the experiment section, we show the effectiveness 
of the MA-FRAME model and its advantages over the 
ST-FRAME model. 


2.5 Hybrid model for video representation 


The ST or MA-FRAME models for the implicit regions 
lAim CISC the explicit regions as boundary condi¬ 
tions, and the probabilistic models for 1a^^ and 
are given by <§ and ( [IT] ) respectively 

lyleo; ^ PC^Aiml^dAiml ^ ' (l^) 

Here, ^dAim represents the boundary condition of lyi , 
which belong to the reconstruction of . It leads to 
seamless boundaries in the synthesis. 

By integrating the explicit and implicit representa¬ 
tion, the video primal sketch has the following proba¬ 
bility model. 


p(I|B,F,a,/3) = 

nex 

-exp{-|] ^ 

i=l (x,y,t)^Aex,i 


{I{x,y,t) - aiBi{x,y,t))‘^ 

27 


(18) 

j=l k=l 


following the video primal sketch model in (18). 

Table shows an example of VPS. For a video of 
the size of 288 x 352 pixels, about 30% of the pixels 
are represented explicitly by Uex = 300 motion primi¬ 
tives. As each primitive needs 11 parameters (the side 
length of the patch according to the primitive learning 
process in section 3.2) to record the profile and 1 more 
to record the type, the number of total parameters for 
the explicit representation is 3,600. riim = 3 textured 
motion regions are represented implicitly by the his¬ 
tograms, which are described by = 11, i ^2 = 12 
and i ^3 = 5 filters respectively. As each histogram has 
15 bins, the number of the parameters for the implicit 
representation is 420. 


2.6 Sketchability and Trackability for Model Selection 


The computation of the VPS involves the partition of 
the domain A into the explicit regions A^x and implicit 
regions Aim • This is done through the sketchability and 
trackability maps. In this subsection, we overview the 
general ideas and refer to previous work on sketchabil- 
ity (|Guo et al ( 2007| )) and trackability (Gong and Zhu 
( 2012[ )) for details. 


Let’s consider one local volume Aq C A of the video 
I. In the video primal sketch model, Iaq niay be mod¬ 
eled either by the sparse coding model in ^ or by the 
FRAME model in (11). The choice is determined via 


the competition between the two models, i.e. compar¬ 


ing which model gives shorter coding length (Shi and 
Zhu ( 2QQ7| )) for representation. 


where Z is the normalizing constant. 

























8 


Zhi Han^’^’^ et al. 


If Iaq is represented by the sparse coding model, the 
posterior probability is calculated by 


p(B|no) = 


1 






20-2 


^ ' I 

where n— |ylo|. The coding length is 


Lexi^Ao) = log ^ ^ log 2^2 + y] 


}• 


'p(B|r 


( 20 ) 

\\Iao - ajB, 

2cr2 


Since cr^ is estimated via the given data temporarily 
in real application, ^ ||I^q — aiBi\\‘^ = holds by 

definition. As a result, the coding length is derived as. 


Lex{lAo) = ■^i^og27ra‘^ + 1 ). 


( 21 ) 


If Iao is described by the FRAME model, the pos¬ 
terior probability is calculated by 


Therefore, when ^{BkllAo) < we consider Lexi^Ao) 
is lower and Iaq is modeled by the sparse coding model, 
else it is modeled by the FRAME model. 

It is clear that ^{Bj^IIao) has the same form and 
meaning with sketchability ( |Gno et al (|2QQ7 )) in ap¬ 
pearance representation and trackability ( |Gong and Zhn 
( 2Q12[ )) in motion representation. Therefore, sketchabil¬ 
ity and trackability can be used for model selection for 
Ipeach local volume. FigJ^(b) and (c) show the sketcha¬ 
bility and trackability maps calculated by the local en- 
etropy of posteriors. The two maps decide the partition 
of the video into the explicit implicit regions. Within 
the explicitly regions, they also decide whether a patch 
is trackable (using primitives with size of 11 x 11 pixels 
x3 frames) or intrackable (using primitives with 11 x 11 
pixels xl frame). 


K 

p(F|I^J cx exp{-y](/3fe,i/fc(I^J)}. (22) 

k=l 

The coding length is estimated through a sequential re¬ 
duction process. When A" = 0, with no constraints, the 
FRAME model is a uniform distribution, and thus the 
coding length is log | i?o | where | i7o | is the cardinality 
of the space of all videos in A. Suppose the intensi¬ 
ties of the video range from 0 to 255, then log |i?o| = 
8 X |Ao|. By adding each constraint, the equivalence 
Q{K) will shrink in size, and the ratio of the compres¬ 
sion log is approximately equal to the informa¬ 

tion gain in ( [1Q| ). Therefore we can calculate the coding 
length by 


3 Algorithms and experiments 

3.1 Spatio-temporal filters 


In the vision literature, spatio-temporal filters have been 


widely used for motion information extraction ( Adel- 
son and Bergen ( 1985| )), optical flow estimation (Heeger 


(1987)), multi-scale representation of temporal data (Lii 


deberg and Fagerstrm (1996)), pattern categorization 


(Wildes and Bergen (2000)), and dynamic texture recog- 
nition ( [Derpanis and Wildesj (2010)). In the experi¬ 
ments, we choose spatio-temporal filters as shown 
in Figj^ It includes three types: 


Lim{lAa) = log |^2o| - log - ... - log (23) 

By comparing Li^ilAo) and Lexi}-Ao)^ whoever has the 
shorter coding length will win the competition and be 
chosen for Iaq. 

In practice, we use a faster estimation which utilizes 
the relationship between the coding length and the en¬ 
tropy of the local posterior probabilities. 

Consider the entropy of p{B\1aq)^ 

n{B\lA,) = -Ep^B,uJlogp{B\lA,)]. (24) 


1 Static filters. Laplacian of Gaussian (LoG), Ga¬ 
bor, gradient, or intensity filter on a single frame. 
They capture statistics of spatial features. 

2 Motion filters. Moving LoG, Gabor or intensity 
filters in different speeds and directions over three 
frames. Gabor motion filters move perpendicularly 
to their orientations. 

3 Flicker filters. One static filter with opposite signs 
at two frames. They contrast the static filter re¬ 
sponses between two consequent frames and detect 
the change of dynamics. 


It measures the uncertainty of selecting a primitive in 
Ab for representation. The sharper the distribution 

p{B\1aq) is, the lower the entropy 1-L{Bj^\1aq) will be. For implicit representation, the filters are 7x7 pix- 

which gives smaller Lqx{1aq) according to ( [21] ). Hence, els in size and have 6 scales, 12 directions and 3 speeds. 
l-L{Bk\lAo) reflects the magnitude of I/diff(Iyio) = Tea:(Iyio)~Each type of filter has a special effect in textured mo- 
Tim(Iyio)- Set an entropy threshold Bo on l-L{Bk\lAo), tion synthesis, which will be discussed in section 3.3 
ideally, ^{Bj^IIaq) = Bo if and only if I/diff(Iyio) = 0- shown in Figj^ 
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I F B (u,v) 
I ■ I (10,14) 
I + I (-4,2) 





( 8 , 0 ) 

( 6 , 12 ) 

( 10 ,- 2 ) 


Fig. 6 Some examples of primitives in a frame of video. Each 
group shows the original local image I, the best fitted filter 
F, the fitted primitive B G and the velocity (u, v), which 
represents the motion of B. 


3.2 Learning motion primitives and reconstructing 
explicit regions 


After computing the sketchability and trackability maps 
of one frame, we extract explicit regions in the video. By 
calculating all the coefficients of each part with motion 
primitives from the primitive bank, aij = .,5^), 

all the aij are ranked from high to low. Each time, 
we select the primitive with the highest coefficient to 
represent the corresponding domain and then do local 
suppression to its neighborhood to avoid excessive over¬ 
lapping of extracted domains. The algorithm is similar 
to matching pursuit (Mallat and Zhang (1993)) and the 
primitives are chosen one by one. 

In our work, in order to alleviate computational 
complexity, aij are calculated by filter responses. The 
filters used here are 11 x 11 pixels and have 18 orienta¬ 
tions and 8 scales. The fitted filter Fj gives a raw sketch 
of the trackable patch and extracts property informa¬ 
tion, such as type and orientation, for generating the 
primitive. If the fitted filter is a Gabor-like filter, the 
primitive Bj is calculated by averaging the intensities of 


the patch along the orientation of Fj , while if the fitted 
filter is a LoG-like filter, Bj is calculated by averaging 
the intensities circularly around its center. Then Bj is 
added to the primitive set B with its motion velocities 
calculated from the trackability map. It is also added 
into Ab for the dictionary buildup. The size of each 
primitive is 11 x 11, the same as the size of the fitted 
filter. And the velocity (u^v) are two parameters for 
recording motion information. In FigJ^ we show some 
examples of different types of primitives, such as blob, 
ridge and edge. Figj^ shows some examples of recon¬ 
struction by motion primitives. In each group, the orig¬ 
inal local image, the fitted filter, the generated prim¬ 
itive and the motion velocity are given. In the frame, 
each patch is marked by a square with a short line for 
representing its motion information. 


Through the matching pursuit process, the sketch- 
able regions are reconstructed by a set of common prim¬ 
itives. Figj^ shows an example of the sketchable region 
reconstruction by using a series of common primitives. 
By comparing the observed frame (a) and reconstructed 
frame (b), (c) shows the error of reconstruction. The 
more detailed quantitative assessment is given in sec¬ 
tion 3.7 It is evident that a rich dictionary of video 
primitives can lead to a satisfactory reconstruction of 
explicit regions of videos. 


For non-sketchable but trackable regions, based on 
the trackability map, we get the motion trace of each 
local trackable patch. Because each patch cannot be 
represented by a shared primitive, we record the whole 
patch and motion information as a special primitive for 
video reconstruction. It is obvious that special prim¬ 
itives increase model complexity compared with com¬ 
mon primitives. However, as stated in section [2Tj the 
percentage of special primitives for the explicit region 
reconstruction of one video is very small (around 2-3%), 
hence it will not affect the final storage space signifi¬ 
cantly. 



Fig. 7 The reconstruction effect of sketchable regions by common primitives, (a) The observed frame, (b) The reconstructed 
frame, (c) The errors of reconstruction. 
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(a) (b) (c) (d) (e) (f) 

Fig. 8 Synthesis for one frame of the ocean textured motion, (a) Initial uniform white noise image, (b) Synthesized frame 
with only static filters, (c) Synthesized frame with only motion filters, (d) Synthesized frame with both of static and motion 
filters, (e) Synthesized frame with all of the 3 types of filters, (f) The original observed frame. 


3.3 Synthesizing textured motions by ST-FRAME 

Each local volume Iaq of textured motion located at Aq 
follows a Markov random field model conditioned on its 
local neighborhood Iqaq following 

p{lAo\^dAo\'P, P) ocexp{-^(/3fc,iJfe(IyiJ)}, (25) 

k 

where Lagrange parameters [3^ = ^ P are the 

discrete form of potential function /d/e() learned from 
input videos by maximum likelihood, 

P = argmmlogp(I^JIaAo;^,F) 

(26) 

= arginax{-logZ(^) - Pk,Hk{lAo) >} 

^ k 

But the closed form of P is not available in general. So 
it can be solved iteratively by 

= ^p(I;/3,F) (27) 

In order to draw a typical sample frame from p(I; F, /3) , 
we use the Gibbs sampler which simulates a Markov 
chain. Starting from any random image, e.g. a white 
noise, it converges to a stationary process with distri¬ 
bution p(I; F, /3). Therefore, we get the final converged 
results dominated by p(I; /3, F), which characterizes the 
observed dynamic texture. 

In summary, the process of textured motion synthe¬ 
sis is given by the following algorithm. 

Algorithm 1. Synthesis for Textured Motion by 
ST-FRAME 

Input video F'’® = {I(i),I(„)}. 

Suppose we have goal is 

to synthesize the next frame 
Select a group of spatio-temporal filters from a filter 
bank F = G 

Compute hk, k = 1 ,..., K of 
Initialize (3^^^ ^ 0, /c = 1,..., A, i = 1,..., L. 


Initialize as a uniform white noise image. 
Repeat 

Calculate A: = 1, 2,..., K from P^^. 

Update f3k, k = 1 ,..., K and p(I; F, P). 

Sample ^ p(I; F, P) by Gibbs sampler. 

Until i <e for k = 1,2,..., AT. 


Figshows an example of the synthesis process, (f) 
is one frame from textured motion of ocean. Starting 
from a white noise frame in (a), (b) is synthesized with 
only 7 static filters. It shows high smoothness in spa¬ 
tial domain, but lacks temporal continuity with previ¬ 
ous frames. However, in (c) the synthesis with only 9 
motion filters has similar macroscopic distribution to 
the observed frame, but appears quite grainy over lo¬ 
cal spatial relationship. By using both static and mo¬ 
tion filters, the synthesis in (d) performs well on both 
spatial and temporal relationships. Compared with (d), 
the synthesis by 2 extra flicker Alters in (e), shows more 
smoothness and more similar to the observed frame. 

In Figj^ we show four groups of textured motion 
(4 bits) synthesis by Algorithm 1: ocean (a), water 
wave (b). Are (c) and forest (d). In each group, as time 
passes, the synthesized frames are getting more and 
more different from the observed one. It is caused by the 
stochasticity of textured motions. Although the synthe¬ 
sized and observed videos are quite different on pixel 
level, the two sequences are perceived extremely identi¬ 
cal by human after matching the histograms of a small 
set of Alter responses. This conclusion can be further 
supported by perceptual studies in section FigjTol 
shows that as changes from white noise (Figga)) 
to the Anal synthesized result (FigJ^e)), the histograms 
of Alter responses become matched with the observed 
ones. 

Table shows the comparison of compression ratios 
between ST-FRAME and the dynamic texture model 
( [Doretto et al| ( [2QQ^ ). It has a signiAcantly better com¬ 
pression ratio than the dynamic texture model, because 
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Fig. 9 Textured motion synthesis examples. For each group, 
the top row are the original videos and the bottom row shows 
the synthesized ones, (a) Ocean, (b) Water wave, (c) Fire, (d) 
Forest. 

Table 2 The number of parameters recorded and the com¬ 
pression ratios for synthesis of 5-frame textured motion videos 
by ST-FRAME and the dynamic texture model (|Doretto et al| 
( [2003t ). 


Example 

Size 

ST-FRAME 

Dynamic 

texture 

Ocean 

112 X 112 X 5 

558(0.89%) 

25,096(40.01%) 

Water 

wave 

105 X 105 X 5 

465(0.84%) 

22,058(40.01%) 

Fire 

110 X 110 X 5 

527(0.87%) 

24,210(40.02%) 

Forest 

110 X 110 X 5 

465(0.77%) 

24,210^0.02%) 


the dynamic texture model has to record PC A compo¬ 
nents as large as the image size. 


3.4 Computing velocity statistics 

One popular method for velocity estimation is optical 
flow. Based on the optical flow, HOOF features extract 


Observed_Before Synthesized - After Synthesized 



Fig. 10 Matching of histograms of spatio-temporal filter 
responses for Ocean. The filters are (a) Static LoG(5x5). 
(b) Static Gradient (vertical), (c) Motion Gabor(6,150°). (d) 
Motion Gabor(2,60°). (e) Motion Gabor(2,0°). (f) Flicker 
LoG(5x5). 


the motion statistics by calculating the distribution of 
velocities in each region. Optical flow is an effective 
method for estimating the motions at trackable areas, 
but does not work for the intrackable dynamic tex¬ 
ture areas. The three basic assumptions for optical flow 
equations, i.e. brightness constancy between matched 
pixels in consecutive frames, smoothness among adja¬ 
cent pixels and slow motion, are violated in these ar¬ 
eas due to the stochastic nature of dynamic textures. 
Therefore, we go for a different velocity estimation method. 

Considering one pixel /(x,^,t) at (x^y) in frame 
t, we denote its neighborhood as IdAa,,y,t- Comparing 
patch Iqa^ y t with all the patches in the previous frame 
within a searching radius, each patch corresponding to 
one velocity v = (vx^Vy), we obtain a distribution 

p{v) oc exp{-\\IdA^,y,t - 11^} (28) 

This distribution describes the probability of the origin 
of the patch, i.e. the location where the patch IdA^^y^t 
moves from. Equivalently, it reflects the average proba¬ 
bility of the motions of the pixels in the patch. There¬ 
fore, by clustering all the pixels according to their veloc¬ 
ity distribution, the cluster center of each cluster gives 
the velocity statistics of all the pixels in this cluster ap¬ 
proximately, which reflects the motion pattern of these 
clustered pixels. Fig fid] and FigfT5|show some examples 
of velocity statistics, in which the brighter, the higher 
probability while the darker, the lower probability. The 
meanings of these two figures are explained later. 

Compared to HOOF, the estimated velocity dis¬ 
tribution is more suitable for modeling textured mo¬ 
tion. Firstly, the velocity distribution is estimated pixel- 
wisely. Hence it can depict more non-smooth motions. 
Secondly, although it seeks to compare the intensity 

















































































12 


Zhi Han^’^’^ et al. 


pattern around a point to nearby regions at a sub¬ 
sequent temporal instance, which seems to also take 
brightness constancy assumption into account, the dif¬ 
ference here is that it calculates the probability of mo¬ 
tions rather than the single pixel correspondence. As a 
result, the constraints by the assumption is weakened, 
and it has the ability to represent stochastic dynamics. 


3.5 Synthesizing textured motions by MA-FRAME 

In MA-FRAME model, similar to ST-FRAME, each 
local volume Iaq of textured motion follows a Markov 
random field model. However, the difference is that MA- 
FRAME extracts motion information via the distribu¬ 
tion of velocities v. 

In the sampling process, lAimj sam¬ 

pled simultaneously following the joint distribution in 

; F, /3) oc exp{-(/3, )}. 

In experiments, we design an effective way for sam¬ 
pling from the above model. For each pixel, we build a 
2D-distribution matrix, whose two dimensions are ve¬ 
locities and intensities respectively, to guide the sam¬ 
pling process. The sampling probability for every can¬ 
didate (labeled by one velocity and one intensity) is 
obtained by integrating motion score, appearance score 
and multiplying smoothness weight, 

SCORE{v,I) 

oc exp{-w(i;)((/?(*), \\H^*'> - 

The details are explained with the illustration by Fig|TT1 
for the sampling method at one pixel. For each pixel 
(x, y) of the current frame A, we consider its every pos¬ 
sible velocity Vij = (vx^i^Vyj) within the range —Vmax < 
Each velocity corresponds to a posi¬ 
tion {x — Vx^i^y — Vyj) in the previous frame It-i - Under 
velocity Vij, the perturbation range of It-i{x — Vx^i^y — 
Vyj) yields the intensity candidates for /t(x, y) which is 
a smaller interval than [0, 255] and thus saves compu¬ 
tational complexity. In the shown example (FigpTj^a)), 
"^max = 1 and the perturbed intensity range is [It-i{x — 
Vx,i^y - ^y,j) - - Vx,i,y - Vyj) + m]. There¬ 

fore we have 9 velocity candidates and 2m + 1 intensity 
candidates for each velocity, hence the size of the sam¬ 
pling matrix is 9 x (2m + 1) (FigjTTj^b)). With the mo¬ 
tion constraints given by matching the velocity statis¬ 
tics, the velocity candidates have their motion scores. 
With the appearance constraints given by matching 
the filter response histograms, intensity candidates have 


Velocity 



(b) 

Fig. 11 Sampling process of MA-FRAME. (a) For each pixel 
of current frame It{x, y), the sample candidates are perturba¬ 
tion intensities of its neighborhood pixels in previous frame 
It-i dominated by different velocities, (b) The velocity list 
and intensity perturbations construct two dimensions of the 
2D distribution matrix, which is used for sampling It{x,y). 
Here, I(vij) is short for It-i{x — Vx,i^y ~ ^^d i,j are 

indexes for different velocities in the neighborhood area. 




their appearance scores. By integrating the two sets of 
scores, we obtain a preliminary sampling matrix shown 
in FigpTjb). 

In order to guarantee the motion of each pixel is as 
consistent as possible with its neighborhoods to make 
the macroscopic motion smooth enough, we add a set 
of weights on the distribution matrix, in which each 
multiplier for candidates of one velocity is calculated 
by 

1 ^y,j ) 

{Xk:yk)^dA(^x^y^ 

The weights encourage the velocity candidate which 
is closer to the velocities of its neighbours. With the 
weights, the sampled velocities are prone to be regarded 
as “blurred” optical flow. The main difference is that 
it preserves the uncertainty of dynamics in a texture 
motion, but not definite velocities of every local pixel. 

After multiplying the weights to the preliminary 
matrix, we get the final sampling matrix. Although the 
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Fig. 12 Texture synthesis for 18 frames of ocean video (from top-left to bottom-right) by MA-FRAME. 




Fig. 13 Texture synthesis for 18 frames of bushes video (from top-left to bottom-right) by MA-FRAME. 


main purpose of MA-FRAME is sampling intensities 
of each pixel from a textured motion, the sampling for 
intensities is highly related to velocities, and the sam¬ 
pling process is actually based on the joint distribution 
of velocity and intensity. 

In summary, textured motion synthesis by MA-FRAME 
is given as follows 


Algorithm 2. Synthesis for Textured Motion by 
MA-FRAME 

Input video F'’® = {I(i),I(„)}. 

Suppose we have is 

to synthesize the next frame 

Select a group of static and flicker filters from a filter 
bank F = G where Ks is the number 

of selected filters. 

Compute {hl!\k = 1 ,..., AT^}, {h^j^\k = of 

Io6s, ig number of velocity clusters. 

Initialize ^ 0, /c = 1,..., A, i = 1,..., L. 

Initialize velocity vector V = {v{x^y))(^x^y^^A 

uniformly, and initialize by choosing intensities 
based on V. 


Repeat 

Calculate = 1,2, ...,As} and 

= from P?'”. 

Update k = ..., K and p(I; F, f3). 

Sample -^p(I,V;F,^) by Gibbs 

sampler. 

Until for 

k = 1,2,..., Ks + Ky. 


Figj^ and Fig{^ show two examples of textured 
motion synthesis by MA-FRAME. Different from the 
synthesis results by ST-FRAME, it can deal with videos 
of larger size, higher intensity level (8 bits here com¬ 
pared to 4 bits in ST-FRAME experiments) and more 
frames because of its smaller sample space and higher 
temporal continuity. Furthermore, it generates better 
motion pattern representations. 

Figjn] shows the comparison of velocities statistics 
between the original video and the synthesized video 
of different textured motion clusters, the brighter, the 
higher motion probability while the darker, the lower 
probability. It is easy to tell that they are quite consis- 
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Fig. 14 Five pairs of global statistics of velocities for com¬ 
parison. Each patch corresponds to the neighborhood lattices 
as shown in Fig |ll[ a). The brighter means higher motion 
probability while the darker means lower probability. 



Fig. 15 Ten pairs of local statistics of velocities for compari¬ 
son. Upper row: original; lower row: synthesis. Each patch cor¬ 
responds to the neighborhood lattices as shown in Fig |ll[ a). 
The brighter means higher motion probability while the 
darker means lower probability. 


tent, which means the original and synthesized videos 
have similar macroscopical motion properties. 

We also test local motion consistency between ob¬ 
served and synthesized videos by comparing velocity 
distributions of every pair of corresponding pixels. Fig|T5] 
shows the comparisons of ten pairs of randomly chosen 
pixels. Most of them match well. It demonstrated that 
the motion distributions of most of local patches also 
preserve well during the synthesis procedure. 


3.6 Dealing with occlusion parts in texture synthesis 

Before providing the full version of computational al¬ 
gorithm for VPS, we first introduce how to deal with 
occluded areas. 

In video, dynamic background textures are often oc¬ 
cluded by the movement of foreground objects. Syn¬ 
thesizing background texture by ST-FRAME uses his¬ 
tograms of spatio-temporal filter responses. When a 
textured region becomes occluded, the pattern no longer 
belongs to the same equivalence class. In this event, 
the spatio-temporal responses are not precise enough 
for matching the given histograms, and may cause a 
deviation in the synthesis results. These errors may ac¬ 


cumulate over frames and the synthesis will ultimately 
degenerate completely. Synthesis by MA-FRAME has 
a greater problem because the intensities in the current 
frame are selected from small perturbations in intensi¬ 
ties from the previous frame. If a pixel cannot be found 
from the neighborhood in the previous frame that be¬ 
longs to the same texture class, the intensity it adopts 
may be incompatible with other pixels around it. 

In order to solve this problem, occluded pixels are 
sampled separately by the original (spatial) FRAME 
model, which means, we have two classes of filter re¬ 
sponse histograms 

1 Static filter response histograms . Histograms 
are calculated by summarizing static filter responses 
of all the textural pixels; 

2 Spatio-temporal filter response histograms . His¬ 
tograms are calculated by summarizing spatio-temporal 
filter responses of all the non-occluded textured pix¬ 
els. 

Therefore, in the sampling process, the occluded 
pixels and non-occluded pixels are treated differently. 
First, their statistics are constrained by different sets of 
filters; second, in MA-FRAME, the intensities of non- 
occlude pixels are sampled from the intensity perturba¬ 
tion of their neighborhood locations in previous frame, 
while the intensities of occluded pixels are sampled from 
the whole intensity space, say 0-255 for 8 bits grey lev¬ 
els. 


3.7 Synthesizing videos with VPS 

In summary, the full version of the computational algo¬ 
rithm for video synthesis of VPS is presented as follows. 

Algorithm 3. Video Synthesis via Video Primal 

Sketch 

Input a video P'’" = 

Compute sketchability and trackability for separating 
-^obs explicit region and implicit region 

for t=l:m 

Reconstruct sparse coding model 

with the selected primitives B chosen from the 
dictionary Ab to get . 

For each region of homogeneous textured motion 
using boundary condition, 

synthesize . by ST-FRAME model or 

MA-FRAME with the selected filter set F chosen 
from the filter bank Ap to get If. 

yZjyVim 

The synthesis of the ith frame of the video is 
given by aligning and together 

seamlessly. 








Video Primal Sketch 


15 


end for 

Output the synthesized video . 


Figi shows this process as we introduced in sec¬ 
tion B FigiD shows three examples of video synthesis 
(YCbCr color space, 8 bits for grey level) by VPS frame 
by frame. In every experiment, observed frames, track- 
ability maps, and final synthesized frames are shown. 
In Table § H .264 is selected as the reference of com¬ 
pression ratio compared with VPS, from which we can 
tell VPS is competitive with state-of-art video encoder 
on video compression. 

For assessing the quality of the synthesized results 
quantitatively, we adopt two criteria for different repre¬ 
sentations, rather than the traditional approach based 
on error-sensitivity as it has a number of limitations 
(Wang et al ( 2QQ4| )). The error for explicit representa¬ 
tions is measured by the difference of pixel intensities 


1 ^ 


■ \\py^ -1 


obs I 


(29) 


while for implicit representations, the error is given by 
the difference of filter response histograms, 

err^”^= \ Y 

Uim ^ 

(30) 

Table shows the quality assessments of the synthe¬ 
sis, which demonstrates good performance of VPS on 
synthesizing videos. 


Table 3 Compression ratio of video synthesis by VPS and 
H.264 to raw image sequence. 


Example 

Raw (Kb) 

VPS (Kb) 

H.264 (Kb) 

1 

924 

16.02 (1.73%) 

20.8 (2.2%) 

2 

1,485 

26.4 (1.78%) 

24 (1.62%) 

3 

1,485 

28.49 (1.92%) 

18 (1.21%) 


t=3 t=5 t=l 




Fig. 16 Video synthesis. For each experiment, Row 1: orig¬ 
inal frames; Row 2: trackability maps; Row 3: synthesized 
frames. 


Table 4 Error assessment of synthesized videos. 


Example 

Size(Pixels) 

Error(/^^J 

Error(/yi^^) 

1 

190 X 330 X 7 

5.37% 

0.59% 

2 

288 X 352 X 7 

3.07% 

0.16% 

3 

288 X 352 X 7 

2.8% 

0.17% 
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3.8 Computational Complexity Analysis 

In this subsection, we analyze the computational com¬ 
plexity of the algorithms studies in this paper. We dis¬ 
cuss the complexity for four algorithms in the following. 
The implementation environment is the desktop com¬ 
puter with Intel Core i7 2.9 GHz CPU, 16GB memory 
and Windows 7 operating system. 

1) Video modeling by VPS. Suppose one frame 
of a video contains N pixels, of which, Nex pixels be¬ 
long to explicit regions and Nim in implicit regions. 
Let the size of the filter dictionary be Np and the filter 
size be 5'i?, the computational complexity for calculat¬ 
ing filter responses is 0{NNfSf)- For extracting and 
learning explicit bricks, the complexity is no more than 
0{NexVSF)- For calculating the response histograms 
of K chosen filters within the implicit regions, the com¬ 
plexity is no more than 0{NimKk) if there are k ho¬ 
mogeneous textural area in the regions. To sum up, 
the total computation complexity for video coding is 
no more than 0{NNfSf N^xV^ -^NimKk). In our 
experiments, for coding one frame of the video with the 
size of 288 x 352, the time consumption is less than 0.5 
seconds. 

2) Reconstruction of explicit regions. Because 
the information of all the basis for explicit regions are 
recorded and there needs no additional computations 
for reconstructing, the computational complexity can 
be regarded as 0(1) and the reconstruction costs no 
time in comparison to other components. 

3) Synthesis of implicit regions by Gibbs sam¬ 
pling by ST-FRAME. For one round sampling, each 
of the Nim pixels will be sampled in the range of the 
overall intensity levels, say L. For every sampling can¬ 
didate, i.e. one intensity, the score is calculated via the 
change of synthesized filter response histograms. To re¬ 
duce the computation burden, we can simply update 
the change of filter responses caused by the change of 
the intensity on the current pixel. This operation re¬ 
quires KSf times of multiplications. As a result, the 
computational complexity for one round sampling of 
one frame is 0{NimLKSF)- In the experiments of this 
paper, one frame will be sampled for about 20 rounds. 
Then the running time is about 2 minutes if the image 
is 4 bits and the size of implicit region is 10, 000 pixels. 

4) Synthesis of implicit regions by Gibbs sam¬ 
pling by MA-FRAME. The computational complex¬ 
ity of MA-FRAME is quite similar with ST-FRAME. 
The biggest difference is the number of sampling candi¬ 
dates. As the number of velocity candidates is Ny and 
the intensity perturbation range is [—m, m], the com¬ 
putational complexity is 0{NimNymKSF), which is on 
the same level with ST-ERAME. However, in real ap¬ 


plication, because the intensities of the neighborhood 
of one pixel are not far away, the intensities of the can¬ 
didates with different velocities is quite redundant. As 
a result, MA-ERAME may save a lot of time compared 
with ST-ERAME, especially when the intensity level 
is high. Eor one frame with 8 bits and 60,000 pixels, 
the running time is about 4 minutes within 20 rounds 
sampling. 

In summary, the computational complexity of video 
modeling / coding by VPS is small, but that of video 
synthesis is quite large. It is because of texture synthesis 
procedure. In VPS, the textures are modeled by MRE 
and synthesized via a Gibbs sampling process, which is 
well known as a computational costing method. How¬ 
ever, the video synthesis is only one of the applications 
of VPS and is used for verifying the correctness of the 
model. As a result, it is not the very important issue 
we care about here. 


3.9 Perceptual Study 


The error assessment of VPS is consistent with human 
perception. To support this claim, in this subsection, we 
present a series of human perception experiments and 
explore the relationship between perception accuracy. 
In the experiments below, the 30 participants include 
graduate students and researchers from mathematics, 
computer science and medical science. The age range is 
from 22 to 39, and they all have normal or corrected- 
to-normal vision. 

In the first experiment, we randomly crop several 
clips of videos with different sizes from the four synthe¬ 
sized textured motion examples and their correspond¬ 
ing original videos (as shown in the left side of Eig. 17, 
[T8| and each video is shown one frame as an exam¬ 

ple which is marked by (a), (b), (c) and (d) respectively, 
and they are in different sizes but shown in the same size 
after zooming for better shows). And then for original 
and synthesized examples respectively, each participant 
is shown 40 clips one by one (10 clips from each tex¬ 
ture) and is required to guess which texture they come 
from. We show 3 representative groups of results below 
for demonstration, in which the sizes of cropped exam¬ 
ples are 5 x 5, 10 x 10 and 20 x 20 respectively. Both 
of the confusion rates (%) of original and synthesized 
examples are shown in the tables on the right side in 
Eig. \T8\ and Each row gives the average confu¬ 
sion rates, which the video clip labeled by the row title 
is judged coming from textures labeled by the column 
titles. In order to test if the syntheses are perceived the 
same with the original videos, we compare the original 
and synthesis confusion tables in each group. Erom the 
results, we can tell that the confusion tables are mostly 
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Original Synthesis 



(a) 

(b) 

(c) 

(d) 


(a) 

(b) 

(c) 

(d) 

^ (a) 

44 

37 

6.7 

12.3 

(a) 

42 

36 

7 

15 

^ (b) 

32 

45.3 

8 

14.7 

(b) 

32.7 

43.3 

8.7 

15.3 

^ (c) 

9.7 

7 

44 

39.3 

(c) 

10 

7.7 

44.3 

38 

^ (d) 

10.3 

12.3 

35 

42.3 

(d) 

10.7 

13.7 

35.3 

40.3 

Fig. IT 

Perceptual confusion test on original and synthe- 


sized textured motions respectively. The size of cropped ex¬ 
amples are 5x5. 



Original 



(a) 

(b) 

(c) 

(d) 

(a) 

54.3 

36.3 

2 

7.3 

(b) 

32 

56 

2.7 

9.3 

(c) 

3.7 

1.7 

59 

35.7 

(d) 

2.7 

13.7 

34 

49.7 


Synthesis 



(a) 

(b) 

(c) 

(d) 

(a) 

53 

36 

3 

8 

(b) 

32 

54.3 

3.7 

10 

(c) 

3.3 

3 

59.7 

34 

(d) 

3 

15.7 

33.7 

47.7 


Fig. 18 Perceptual confusion test on original and synthe¬ 
sized textured motions respectively. The size of cropped ex¬ 
amples are 10 X 10. 


Table 5 The ANOVA results of analyzing recognition accu¬ 
racies of original and synthesized textures. For each texture 
in every group, the corresponding F and p values are shown 
respectively. 


F/p 

Group 1 

Group 2 

Group 3 

(a) 

1.34/0.2520 

0.65/0.4222 

0.02/0.8813 

(b) 

0.96/0.3305 

0.70/0.4065 

0.20/0.6583 

(c) 

0.06/0.8100 

0.15/0.6993 

0.03/0.8563 

(d) 

1.43/0.2366 

1.08/0.3088 

0.26/0.6151 


when the size goes larger than 25 x 25 in this experi¬ 
ment, the accuracies get very close to 100%. This exper¬ 
iment demonstrates the fact that the dynamic textures 
synthesized by the statistics of dynamic filters can be 
well discriminated by human vision, although the syn¬ 
thesized one and the original one are totally different 
on pixel level. Therefore it is evident that the approx¬ 
imation of filter response histograms reflects the qual¬ 
ity of video synthesis. Furthermore, it is proved that 
larger area textures give much better perception effect 
because human can extract more macroscopic statis¬ 
tical information and motion-appearance characteris¬ 
tics, while small size local areas can only provide salient 
structural information which may be shared by a vari¬ 
ous of different videos. 



(a) 

Original 

(b) 

(c) 

(d) 


(a) 

Synthesis 

(b) 

(c) 

(d) 

(a) 

82 

15 

0.7 

2.3 

(a) 

81.7 

14 

1 

3.3 

(b) 

13.3 

84.3 

0.3 

2 

(b) 

14 

83.3 

0.7 

2 

(c) 

2.3 

0.7 

77.3 

19.7 

(c) 

2 

0.7 

77.7 

19.7 

(d) 

1.3 

3.7 

16.7 

78.3 

(d) 

1.7 

4.3 

16.7 

77.3 


Table 6 The accuracy of differentiating the original video 
from the synthesized one in different scales. As the percentage 
is getting closer to 50%, it means it is harder to discriminate 
the original and synthesized videos for observers. 


Video 

Scale 100% 

Scale 75% 

Scale 50% 

Scale 25% 

1 

66.7 

56.7 

46.7 

50 

2 

100 

90 

73.3 

63.3 

3 

73.3 

63.3 

50 

53.3 


Fig. 19 Perceptual confusion test on original and synthe¬ 
sized textured motions respectively. The size of cropped ex¬ 
amples are 20 X 20. 

consistent. For more precise quantitative estimation, we 
also analyze the recognition accuracies by ANOVA in 
Table in which, each row shows the corresponding F 
and p values for each texture in all the three groups. The 
results show that the recognition accuracies on original 
and synthesized textures do not differ signiflcantly. 

Also, it is noted that texture (a) and (b) appear 
similarly while (c) and (d) tend to be confused with 
each other. Therefore, the confusion rates between (a) 
and (b), (c) and (d) are apparently larger. However, 
from Fig. [^to[^ as the size of cropped videos gets 
larger, the confusion rate becomes lower, and actually 


In the second experiment, we test if the synthesized 
video by VPS gives similar vision impact compared 
with the original video. Each time we provide the orig¬ 
inal and the synthesized videos to one participant in 
the same scale. The videos are played synchronously 
and the participants are required to point out which is 
the original video in 5 seconds. Each pair of videos is 
tested in four scales, 100%, 75%, 50% and 25%. The 
accuracy are shown in Table Erom the result, when 
the videos are shown in larger scales, it is easier to dis¬ 
criminate the original and synthesized videos, because 
a lot of structural details can be noticed by the ob¬ 
servers. But as the scale gets smaller, the macroscopic 
information gives the major impact to the vision sys¬ 
tem, therefore the original and synthesized video are 
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perceived almost the same, so that the accuracy get 
lower and approach to 50%. From this experiment, it 
is evident that although VPS cannot give the complete 
reconstruction of a video on pixel level, especially for 
dynamic textures, but the synthesis gives human sim¬ 
ilar vision impact, which means most of the key infor¬ 
mation for perception are kept via VPS model. 


3.10 VPS adapting over scales, densities and dynamics 


As it was observed in ( Gong and Zhn| ( |2012[ )) that the 
optimal visual representation at a region is affected by 
distance, density and dynamics. In Figj^ we show four 
video clips from a long video sequence. As the scale 
changes from high to low over time, the birds in the 
videos are perceived by lines of boundary, groups of 
kernels, dense points and dynamic textures respectively. 
We show the VPS of each clip and demonstrate that the 
proper representations are chosen by the model. Fig^ 
shows the types of chosen primitives for explicit rep¬ 
resentations, in which circles represent blob-like type 
while short lines represent edge-like type primitives. 
Table gives corresponding comparisons between the 
number of blob-like and edge-like primitives in each 
scale. For each scale, the comparison is within first 50, 
100, 150 and 200 chosen primitives respectively. It is 
quite obvious that the percentage of chosen edge-like 
primitives in large scale frame is much higher than that 
in small scale. Meanwhile, in large scale frame, the blob- 
like primitives start to appear very late, which shows 
the fact that edge-like primitives are much more im¬ 
portant in this scale for representing videos. But in 
small scale frame, the blob-like primitives possess a 
large percentage at the very beginning, and the number 
increase of edge-like primitives gets quicker and quicker 
while more and more primitives are chosen. This phe¬ 
nomenon demonstrates blob-like structures are much 
more prominent in small scale. So from this experiment, 
it is evident that VPS can choose proper representa¬ 
tions automatically and furthermore, the representation 
patterns may reflect the scale of the videos. 


Table 7 The comparisons between the number of blob-like 
and edge-like primitives in 3 scales. For each scale, the num¬ 
bers are compared in first 50, 100, 150 and 200 primitives 
respectively. 


Scale 

First 50 

First 100 

First 150 

First 200 

1 

0/50 

0/100 

1/149 

6/194 

2 

0/50 

6/94 

16/134 

23/177 

3 

19/31 

37/63 

58/92 

71/129 



Fig. 20 Representation switches triggered by scale. Row 1: 
observed frames; Row 2: trackability maps; Row 3: synthe¬ 
sized frames. 



Fig. 21 Representation types in different scale video frames, 
where circles represent blob-like type and short lines represent 
edge-like type. 


3.11 VPS supporting action representation 


VPS is also compatible with high-level action represen¬ 
tation. By grouping meaningful explicit parts in a prin¬ 
cipled way, it represents an action template. In Figj^ 
(b) is the action template given by the deformable ac¬ 
tion template model (Yao and Zhu ( 2QQ9| )) from the 
video shown in (a). The action template is essentially 
the sketches from the explicit regions, (c) shows an ac¬ 
tion synthesis with only filters from a matching pursuit 
process. While in (d), following the VPS model, the ac¬ 
tion parts and a few sketchable background are recon¬ 
structed by the explicit representation, and the large 
region of water is synthesized by the implicit representa¬ 
tion; thus we get the synthesis of the whole video. Here, 
the explicit regions correspond to meaningful “template” 
parts, while the implicit regions are auxiliary background 
parts. 

In order to show the relationship between VPS rep¬ 
resentation and effective high-level features, we take an 
KTH video ( [Schnldt et al ( 2QQ4[ )) as an example. Figj^ 
and Figshow the spatial and temporal features of ex¬ 
plicit regions respectively. In Fig|^ we compare VPS 


spatial descriptor with well-known HOG feature (Dalai 


and Triggs ( 2QQ5[ )), which has been widely used for ob¬ 
ject representation recently, (b) is the HOG descriptor 
for the human in one video frame (a), (c) shows struc¬ 
tural features extracted by VPS, where circles and short 
edges represent 53 local descriptors. Compared with 
HOG in (b), VPS makes a local decision on each area 
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(a) Input 


(b) Action Template 



(c) Aetion Synthesis 


(d) Video Synthesis 


Fig. 22 Action representation by VPS. (a) The input video, 
(b) A ction template obtain ed by the deformable action tem¬ 
plate ( |Yao and Zhu ( |2009| )). (c) Action synthesis by filters, 
(d) Video synthesis by VPS. 


based on statistics of filter responses, therefore it pro¬ 
vides shorter coding length than HOG. Furthermore, it 
gives more precise description than HOG, e.g. the head 
part is represented by a circle descriptor, which contains 
more information than pure filter response histogram 
like HOG. And (d) gives a synthesis with correspond¬ 
ing filters, which shows the human boundary precisely. 

In Figj^ we show the motion information between 
two continuous frames (a) and (b) extracted by MA- 
FRAME in VPS. (d) gives the clustered motion styles 
in the current video. The motion statistics of the five 
styles are shown in (e) respectively. It is obvious that re¬ 
gion 1 represents the area of head, which is almost still 
in the waving motion, while region 5 is for two arms, 
which shows definite moving direction. Region 3 repre¬ 
sents the legs, which is actually an oriented trackable 
area. Region 2 and 4 are relatively ambiguous in motion 
direction, which are basically background of textures in 
the video. After giving the trackability map shown in 
(c) based on these motion styles, the motion template 
pops up. 

In summary, the information extracted by VPS is 
compatible with high-level object and motion represen¬ 
tations. Especially, it is very close to HOG and HOOE 
descriptors, which are proven effective spatial and tem¬ 
poral features respectively. The main difference is VPS 
makes a local decision to give a more compact expres¬ 
sion and be better for visualization. Therefore, VPS 
does not only give a middle-level representation for video, 
but also has strong connection with low-level vision fea¬ 
tures and high-level vision templates. 



(c) (d) 

Fig. 23 Structural information extracted by HOG and VPS. 
(a) The input video frame, (b) HOG descriptor, (c) VPS fea¬ 
ture. (d) Boundary synthesis by filters. 



Fig. 24 Motion statistics by VPS. (a) and (b) two contin¬ 
uous video frames of waving hands, (c) Trackability map. 
(d) Glustered motion style regions, (e) Gorresponding motion 
statistics of each region. 


4 Discussion and Conclusion 


In this paper, we present a novel video primal sketch 
model as a middle-level generic representation of video. 
It is generative and parsimonious, integrating a sparse 
coding model for explicitly representing sketchable and 
trackable regions and extending the FRAME models for 
implicitly representing textured motions. It is a video 
extension of the primal sketch model (Guo et al ( 2QQ7[ )). 
It can choose appropriate models automatically for video 
representation. 
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Based on the model, we design an effective algo¬ 
rithms for video synthesis, in which, explicit regions 
are reconstructed by learned video primitives and im¬ 
plicit regions are synthesized through a Gibbs sampling 
procedure based on spatio-temporal statistics. Our ex¬ 
periments shows that VPS is capable for video mod¬ 
eling and representation, which has high compression 
ratio and synthesis quality. Furthermore, it learns ex¬ 
plicit and implicit expressions for meaningful low-level 
vision features and is compatible with high-level struc¬ 
tural and motion representations, therefore provides a 
unified video representation for all low, middle and high 
level vision tasks. 

In ongoing work, we will strengthen our work from 
several aspects, especially enhance the connections with 
low-level and high-level vision tasks. For low-level study, 
we are learning a much richer dictionary oi Ab for video 
primitives, which is more comprehensive. For high-level 
application, we are applying the VPS features to object 
and action representation and recognition. 
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